Targeted Gene Metagenomic Data Analysis ◾ 281
com”, Silva (16S/18S rRNA) at https://www.arb-silva.de, and UNITE (fungal ITS) at
“https://unite.ut.ee/”. The database must be downloaded and then imported in QIIME2 as
artifact before being used for clustering. For example, we will download the latest release
of curated OTUs from GreenGenes database:
wget ftp://greengenes.microbio.me/greengenes_release/gg_13_5/
gg_13_8_otus.tar.gz
tar vxf gg_13_8_otus.tar.gz
rm gg_13_8_otus.tar.gz
Make sure that the URL is a single line with no white space. You can visit the website to
download the latest release.
The files will be extracted into a directory (gg_13_8_otus). Display the contents of this
directory and its subdirectories using “ls” Linux command. You will find four subdirec-
tories: “otus” (for reference OTUs), “rep_set” (for the reference representative sequences),
“rep_set_aligned” (for aligned representative sequences), “taxonomy” (for taxonomy
files), and “trees” (for phylogenetic trees). The files in these directories contain data at dif-
ferent identities (e.g., 99%, 97%, and 94%). Keep this database as we will use it for other
applications.
To use the reference database for clustering, you need to import the file of the database
representative sequences (FASTA file). You need to choose at which identity you wish to
perform clustering. Assume that you want to cluster your sample sequences at 97% iden-
tity, then you can import “rep_set/97_otus.fasta” onto QIIME2 as artifact using “tools
import”. To keep the files organized, we will create the subdirectory “closed_ref_cl_97” for
closed-reference clustering files.
mkdir closed_ref_cl_97
Then, import the database representative sequences into QIIME2 artifact.
qiime tools import \
--type ‘FeatureData[Sequence]’ \
--input-path gg_13_8_otus/rep_set/97_otus.fasta \
--output-path inputs/97_otus-GG_db.qza
Then, you can use the “cluster-features-closed-reference” method of the “q2-vsearch” plu-
gin to perform the closed-reference clustering on the features generated in the derepli-
cation steps. The input artifacts are: dereplicated feature table “derep-yoga-table.qza”,
dereplicated representative sequences “derep-yoga-seqs.qza”, and the reference representa-
tive sequences from the database “97_otus-GG_db.qza”.
qiime vsearch cluster-features-closed-reference \
--i-table inputs/derep-yoga-table.qza \
--i-sequences inputs/derep-yoga-seqs.qza \